ulcerative colitis
Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights
Kim, Hyunjae, Sohn, Jiwoong, Gilson, Aidan, Cochran-Caggiano, Nicholas, Applebaum, Serina, Jin, Heeju, Park, Seihee, Park, Yujin, Park, Jiyeong, Choi, Seoyoung, Contreras, Brittany Alexandra Herrera, Huang, Thomas, Yun, Jaehoon, Wei, Ethan F., Jiang, Roy, Colucci, Leah, Lai, Eric, Dave, Amisha, Guo, Tuo, Singer, Maxwell B., Koo, Yonghoe, Adelman, Ron A., Zou, James, Taylor, Andrew, Cohan, Arman, Xu, Hua, Chen, Qingyu
Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.
- Europe > Austria > Vienna (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
- (14 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Lesion-Aware Visual-Language Fusion for Automated Image Captioning of Ulcerative Colitis Endoscopic Examinations
Escamilla, Alexis Ivan Lopez, Ochoa, Gilberto, Al, Sharib
We present a lesion-aware image captioning framework for ulcerative colitis (UC). The model integrates ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder. Clinical metadata (MES score 0-3, vascular pattern, bleeding, erythema, friability, ulceration) is injected as natural-language prompts to guide caption generation. The system produces structured, interpretable descriptions aligned with clinical practice and provides MES classification and lesion tags. Compared with baselines, our approach improves caption quality and MES classification accuracy, supporting reliable endoscopic reporting.
- North America > Mexico (0.05)
- Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
- Europe > France (0.04)
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.98)
Diagnosis and Severity Assessment of Ulcerative Colitis using Self Supervised Learning
Ulcerative Colitis (UC) is an incurable inflammatory bowel disease that leads to ulcers along the large intestine and rectum. The increase in the prevalence of UC coupled with gastrointestinal physician shortages stresses the healthcare system and limits the care UC patients receive. A colonoscopy is performed to diagnose UC and assess its severity based on the Mayo Endoscopic Score (MES). The MES ranges between zero and three, wherein zero indicates no inflammation and three indicates that the inflammation is markedly high. Artificial Intelligence (AI)-based neural network models, such as convolutional neural networks (CNNs) are capable of analyzing colonoscopies to diagnose and determine the severity of UC by modeling colonoscopy analysis as a multi-class classification problem. Prior research for AI-based UC diagnosis relies on supervised learning approaches that require large annotated datasets to train the CNNs. However, creating such datasets necessitates that domain experts invest a significant amount of time, rendering the process expensive and challenging. To address the challenge, this research employs self-supervised learning (SSL) frameworks that can efficiently train on unannotated datasets to analyze colonoscopies and, aid in diagnosing UC and its severity. A comparative analysis with supervised learning models shows that SSL frameworks, such as SwAV and SparK outperform supervised learning models on the LIMUC dataset, the largest publicly available annotated dataset of colonoscopy images for UC.
- South America > Uruguay > Maldonado > Maldonado (0.04)
- North America > United States > Virginia (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Research Report > Experimental Study (0.67)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Oncology > Colorectal Cancer (1.00)
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
Arges: Spatio-Temporal Transformer for Ulcerative Colitis Severity Assessment in Endoscopy Videos
Chaitanya, Krishna, Damasceno, Pablo F., Fadnavis, Shreyas, Mobadersany, Pooya, Parmar, Chaitanya, Scherer, Emily, Zemlianskaia, Natalia, Surace, Lindsey, Ghanem, Louis R., Cula, Oana Gabriela, Mansi, Tommaso, Standish, Kristopher
Accurate assessment of disease severity from endoscopy videos in ulcerative colitis (UC) is crucial for evaluating drug efficacy in clinical trials. Severity is often measured by the Mayo Endoscopic Subscore (MES) and Ulcerative Colitis Endoscopic Index of Severity (UCEIS) score. However, expert MES/UCEIS annotation is time-consuming and susceptible to inter-rater variability, factors addressable by automation. Automation attempts with frame-level labels face challenges in fully-supervised solutions due to the prevalence of video-level labels in clinical trials. CNN-based weakly-supervised models (WSL) with end-to-end (e2e) training lack generalization to new disease scores and ignore spatio-temporal information crucial for accurate scoring. To address these limitations, we propose "Arges", a deep learning framework that utilizes a transformer with positional encoding to incorporate spatio-temporal information from frame features to estimate disease severity scores in endoscopy video. Extracted features are derived from a foundation model (ArgesFM), pre-trained on a large diverse dataset from multiple clinical trials (61M frames, 3927 videos). We evaluate four UC disease severity scores, including MES and three UCEIS component scores. Test set evaluation indicates significant improvements, with F1 scores increasing by 4.1% for MES and 18.8%, 6.6%, 3.8% for the three UCEIS component scores compared to state-of-the-art methods. Prospective validation on previously unseen clinical trial data further demonstrates the model's successful generalization.
K-QA: A Real-World Medical Q&A Benchmark
Manes, Itay, Ronn, Naama, Cohen, David, Ber, Ran Ilan, Horowitz-Kugler, Zehavi, Stanovsky, Gabriel
Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.
- South America > Brazil (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Indiana > Hamilton County > Carmel (0.04)
- (3 more...)
Artificial Intelligence Helps Predict Ulcerative Colitis Flare-ups, Prognosis
Iacucci and her colleagues recruited patients from 11 international centers between September 2016 and November 2019. Eligible participants had a confirmed diagnosis of ulcerative colitis for at least one year without regard to disease activity and an indication for a colonoscopy. At least two tissue samples were obtained from the rectum and the sigmoid because they are common areas representative of healing and inflammation. The endoscopic exam was recorded in the same area. Clinical outcomes used as proxies for disease flare-ups for the purpose of prognosis assessment included ulcerative colitis-related hospitalizations or surgery and increase in initiation of or changes in ulcerative colitis treatments, such as immunomodulators, biologics, or steroids, due to worsening symptoms.
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
Gastrointestinal Disorder Detection with a Transformer Based Approach
Hosain, A. K. M. Salman, islam, Mynul, Mehedi, Md Humaion Kabir, Kabir, Irteza Enan, Khan, Zarin Tasnim
Accurate disease categorization using endoscopic images is a significant problem in Gastroenterology. This paper describes a technique for assisting medical diagnosis procedures and identifying gastrointestinal tract disorders based on the categorization of characteristics taken from endoscopic pictures using a vision transformer and transfer learning model. Vision transformer has shown very promising results on difficult image classification tasks. In this paper, we have suggested a vision transformer based approach to detect gastrointestianl diseases from wireless capsule endoscopy (WCE) curated images of colon with an accuracy of 95.63\%. We have compared this transformer based approach with pretrained convolutional neural network (CNN) model DenseNet201 and demonstrated that vision transformer surpassed DenseNet201 in various quantitative performance evaluation metrics.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
- North America > United States > New York > Monroe County > Rochester (0.04)
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
Patch-level instance-group discrimination with pretext-invariant learning for colitis scoring
Xu, Ziang, Ali, Sharib, Gupta, Soumya, Leedham, Simon, East, James E, Rittscher, Jens
Inflammatory bowel disease (IBD), in particular ulcerative colitis (UC), is graded by endoscopists and this assessment is the basis for risk stratification and therapy monitoring. Presently, endoscopic characterisation is largely operator dependant leading to sometimes undesirable clinical outcomes for patients with IBD. We focus on the Mayo Endoscopic Scoring (MES) system which is widely used but requires the reliable identification of subtle changes in mucosal inflammation. Most existing deep learning classification methods cannot detect these fine-grained changes which make UC grading such a challenging task. In this work, we introduce a novel patch-level instance-group discrimination with pretext-invariant representation learning (PLD-PIRL) for self-supervised learning (SSL). Our experiments demonstrate both improved accuracy and robustness compared to the baseline supervised network and several state-of-the-art SSL methods. Compared to the baseline (ResNet50) supervised classification our proposed PLD-PIRL obtained an improvement of 4.75% on hold-out test data and 6.64% on unseen center test data for top-1 accuracy.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.15)
- Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
Class Distance Weighted Cross-Entropy Loss for Ulcerative Colitis Severity Estimation
Polat, Gorkem, Ergenc, Ilkay, Kani, Haluk Tarik, Alahdab, Yesim Ozen, Atug, Ozlen, Temizel, Alptekin
Endoscopic Mayo score and Ulcerative Colitis Endoscopic Index of Severity are commonly used scoring systems for the assessment of endoscopic severity of ulcerative colitis. They are based on assigning a score in relation to the disease activity, which creates a rank among the levels, making it an ordinal regression problem. On the other hand, most studies use categorical cross-entropy loss function, which is not optimal for the ordinal regression problem, to train the deep learning models. In this study, we propose a novel loss function called class distance weighted cross-entropy (CDW-CE) that respects the order of the classes and takes the distance of the classes into account in calculation of cost. Experimental evaluations show that CDW-CE outperforms the conventional categorical cross-entropy and CORN framework, which is designed for the ordinal regression problems. In addition, CDW-CE does not require any modifications at the output layer and is compatible with the class activation map visualization techniques.
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
- North America > United States > Iowa > Johnson County > Iowa City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (3 more...)
Automatic Estimation of Ulcerative Colitis Severity from Endoscopy Videos using Ordinal Multi-Instance Learning
Schwab, Evan, Cula, Gabriela Oana, Standish, Kristopher, Yip, Stephen S. F., Stojmirovic, Aleksandar, Ghanem, Louis, Chehoud, Christel
Ulcerative colitis (UC) is a chronic inflammatory bowel disease characterized by relapsing inflammation of the large intestine. The severity of UC is often represented by the Mayo Endoscopic Subscore (MES) which quantifies mucosal disease activity from endoscopy videos. In clinical trials, an endoscopy video is assigned an MES based upon the most severe disease activity observed in the video. For this reason, severe inflammation spread throughout the colon will receive the same MES as an otherwise healthy colon with severe inflammation restricted to a small, localized segment. Therefore, the extent of disease activity throughout the large intestine, and overall response to treatment, may not be completely captured by the MES. In this work, we aim to automatically estimate UC severity for each frame in an endoscopy video to provide a higher resolution assessment of disease activity throughout the colon. Because annotating severity at the frame-level is expensive, labor-intensive, and highly subjective, we propose a novel weakly supervised, ordinal classification method to estimate frame severity from video MES labels alone. Using clinical trial data, we first achieved 0.92 and 0.90 AUC for predicting mucosal healing and remission of UC, respectively. Then, for severity estimation, we demonstrate that our models achieve substantial Cohen's Kappa agreement with ground truth MES labels, comparable to the inter-rater agreement of expert clinicians. These findings indicate that our framework could serve as a foundation for novel clinical endpoints, based on a more localized scoring system, to better evaluate UC drug efficacy in clinical trials.
- Health & Medicine > Therapeutic Area > Gastroenterology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)